8 research outputs found

    Benchmarking Top-K Keyword and Top-K Document Processing with T2{}^2K2{}^2 and T2{}^2K2{}^2D2{}^2

    Full text link
    Top-k keyword and top-k document extraction are very popular text analysis techniques. Top-k keywords and documents are often computed on-the-fly, but they exploit weighted vocabularies that are costly to build. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present T2{}^2K2{}^2, a top-k keywords and documents benchmark, and its decision support-oriented evolution T2{}^2K2{}^2D2{}^2. Both benchmarks feature a real tweet dataset and queries with various complexities and selectivities. They help evaluate weighting schemes and database implementations in terms of computing performance. To illustrate our bench-marks' relevance and genericity, we successfully ran performance tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on different relational (Oracle, PostgreSQL) and document-oriented (MongoDB) database implementations, on the other hand

    Automatic Language Identification for Romance Languages using Stop Words and Diacritics

    No full text
    International audienceAutomatic language identification is a natural language processing problem that tries to determine the natural language of a given content. In this paper we present a statistical method for automatic language identification of written text using dictionaries containing stop words and diacritics. We propose different approaches that combine the two dictionaries to accurately determine the language of textual corpora. This method was chosen because stop words and diacritics are very specific to a language, although some languages have some similar words and special characters they are not all common. The languages taken into account were romance languages because they are very similar and usually it is hard to distinguish between them from a computational point of view. We have tested our method using a Twitter corpus and a news article corpus. Both corpora consists of UTF-8 encoded text, so the diacritics could be taken into account, in the case that the text has no diacritics only the stop words are used to determine the language of the text. The experimental results show that the proposed method has an accuracy of over 90% for small texts and over 99.8% for large texts

    Automatic Language Identification for Romance Languages using Stop Words and Diacritics

    No full text
    International audienceAutomatic language identification is a natural languageprocessing problem that tries to determine the naturallanguage of a given content. In this paper we present a statisticalmethod for automatic language identification of written textusing dictionaries containing stop words and diacritics. Wepropose different approaches that combine the two dictionariesto accurately determine the language of textual corpora. Thismethod was chosen because stop words and diacritics are veryspecific to a language, although some languages have some similarwords and special characters they are not all common. Thelanguages taken into account were romance languages becausethey are very similar and usually it is hard to distinguish betweenthem from a computational point of view. We have tested ourmethod using a Twitter corpus and a news article corpus. Bothcorpora consists of UTF-8 encoded text, so the diacritics couldbe taken into account, in the case that the text has no diacriticsonly the stop words are used to determine the language of thetext. The experimental results show that the proposed methodhas an accuracy of over 90% for small texts and over 99.8% forlarge texts

    A Lagrangian Backward Air Parcel Trajectories Clustering Framework

    No full text
    Many studies concerning atmosphere moisture paths use Lagrangian backward air parcel trajectories to determine the humidity sources for specific locations. Automatically grouping trajectories according to their geographical position simplifies and speeds up their analysis. In this paper, we propose a framework for clustering Lagrangian backward air parcel trajectories, from trajectory generation to cluster accuracy evaluation. We employ a novel clustering algorithm, called DenLAC, to cluster troposphere air currents trajectories. Our main contribution is representing trajectories as a one-dimensional array consisting of each trajectory’s points position vector directions. We empirically test our pipeline by employing it on several Lagrangian backward trajectories initiated from Břeclav District, Czech Republic
    corecore